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Irom  a  known  ii..  t,/_  ^  F(-)>  n  san^iles  are  drawn,  and  are  ordered  as 

^2'  V  a'^cordlng  to  the  value  of  b  (O^  oo).  For  example,  the  b's 

might  be  family  names,  drawn  from  some  kncwn  ethnic  distribution,  and  arranged 
alphabetically  for  filing  purposes.  These  ordered  samples  are  to  be  divided 
^  into  groups,  and  stored  in  filing  drawers  (or  blocks  of  coo5>uter  storage),  with 
space  left  in  each  drawer  for  additions  to  the  file.  If  many  additions  are  made 
to  the  file,  of  course,  there  is  the  possibility  that  a  ^awer  will  overflow, 

m 

and  that  the  filing  groups  will  have  to  be  reorganized.  Given  a  fixed  initial 
sample  size  and  common  capacity  of  each  drawer,  the  best  system  design  will 
group  the  names  so  that  the  overflow  probability  is  the  same  for  each  drawer. 

In  this  paper,  we  shall  indicate  how  to  perform  this  initial  a  priori  grouping 
of  the  files. 

Suppose  that  a  certain  drawer  begins  with  the  t^^  sample  whose  value  is  b  , 

and  ends  ^ith  the  (t-ni)^^  sample  whose  value  is  b^^.  Now  suppose  that  N  new 

independent  samples  are  drawn  from  the  same  distribution,  and  added  to  the  files 

in  correct  order.  The  probability  that  x  of  the  new  samples  will  be  placed  in 

this  particular  drawer  (i.e.,  that  x  of  their  values  lie  between  b  and  b  )  is 

t  t+d^ 

just: 


p^  (i-p)”-^ 


cu=0,l,..n« 

^  ,  .n+l-d 

x=0,l,..N 


(1) 


where  p  _  is  the  probability  that  a  single  new  saa^ile  would  fall 

in  this  drawer,  and  where  for  convenience  we  have  defined  b  =  0  and  b  =  « 

0  n+1 

Usually,  however,  when  a  filing  system  is  planned,,  it  is  not  possible  to 
knew  in  advance  the  values  of  the  first  and  last  names  in  drawer,  but  only  their 
file  positions,  the  Indices  t  and  Thus,  one  is  interested  in  the  probability 


of  insertation  In  a  given  drawer,  averaged  over  all  possible  values,  and 

it  ry-y 

b^^,  of  the  positions,  t  and  t+d.  It  is  known  that  the  Joint  distribution 
density  of  the  t^  and  (t+<i.)  ordered  sainples  is: 


n: 


(t-l)l  (d-l)l  (n-t-d)! 


[F(b^)] 


t-1 


(2) 


1) 

so  that  the  averaged  distribution  of  new  seunples  lying  between  the  n  and 
(t+d)  positions  of  the  old  sainples  is  Just: 


=  v^(N,n;t,t+d)  =  /  /  “^t+d  ^(\>^t+d^  l^t>\+d^‘ 


(3) 


•  » 

A  first  change  of  variable,  z  =  1  -  F(b^)/F(b^^),  with  respect  to  b^;  then  a 
second  change  of  variable,  y  =  F(b^^);  and  we  are  led  to  the  simplified  form: 


,  x+t+d-1  /,  V n-t-d  j  x+d-1  vt-l  vN-x  /,  v 

w^  =  K  /  dy  y  (l-y)  /  dz  z  (l-z)  (l-yz)  ^  ,  (4) 

0  O 

with  K  =  (ni)  [(t-1);  (d-1);  (n-t-d);]‘^  . 

N"X 

Upon  expanding  (l-yz)  with  the  binomial  theorem,  integrating  the 

•  * 

separated  Integrals  using  the  definition  of  Beta  functions,  and  sunming  the 
resulting  series  through  the  formula: 


* 


See,  for  example,  Gumbal,  E.  J. ,  Statistics  of  Extremes,  Columbia  University 
Press,  New  York  (1958 )>  P*  55« 


f  ♦ 


Ff 


m 


I 

k*0  ^ 


,  (a-A)J  /  xl 
k  )  (a-fb+k) i 


e.'  (b-Hn-l); 
(a-»t-Hn)l  (b-l)j  ^ 


(5) 


one  finds  after  some  algebraic  manipvLLation  that: 


C+n-d-x\  /d+x-l\ 

n-d  /  V  d-1  / 


V  (N,n;t,t-fd) 

( T) 


X  =  0,1. .  .N 


(6) 


®®  * 


the  so-called  distribution  of  exceedances,  whose  properties  are  known*.  For 
example,  the  mean  number  of  names  (out  of  the  N  new  samples)  which  lie  between 
the  t^^  and  the  (t+d)^^  positions  (of  the  original  n  ordered  samples)  is: 


X  = 


dN 


n+1 


(7) 


with  a  variance: 


_  Nd(n-d4l) (d+n+1) 
^  (n+1)^  (n+e) 


(8) 


There  are  several  interesting  properties  of  the  distribution  (6).  First, 
the  probability  that  x  of  the  new  names  fall  in  a  given  drawer  depends  only  on 
the  number  of  original  samples  defining  the  drawer,  and  not  upon  the  index 

t  which  describes  the  position  of  the  first  name  of  the  drawer  in  the  original 

•r 

sample.  In  other  words: 


w^(N,n;0,d)  =  w^(N,n;l,d+l)  =  ...=  w^(N,nj t, t+d)  =  ...  w^(N,n;n+d-l,n+l)  . 


(9) 


Secondly,  we  note  that  the  averaged  distribution  is  independent  of^the  saii5,ling 
distribution;  roughljr  speaking,  this  is  because  on  the  average  the  new  saa^jles 

tend  to  "fill  out"  the  sampling  distribution  in  the  same  way  that  the  original 
samples  did.  • 


op.clt.,  pp.  58-65. 
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For  only  one  new  sau^jle  (N=1),  the  averaged  distribution  (6)  reduces  to  the 
binomial  distribution  obtained  from  (l)  by  substituting  the  average  probability, 
P  =  d/(n+l),  between  two=>ordered  sangples  d  units  apart;  however,  for  more  than 
one  new  sample,  the  variance  (8)  is  larger  than  that  obtained  by  using  the 
average  parameter,  p  ,  in  the  bincmial  distribution. 


In  the  usual  applications,  both  samples  are  large,  with: 


Lim 

If*  « 
n-»  00 


N 

N+n 


f 


In  this  case : 


w 


X 


d-l+x\ 

d-1  /  ^  ,  x=  0,1,2, 


^  (10) 


a  special  case  of  the  negative  binomial  distribution. 

To  answer  the  design  problem  posed  at  the  beginning,  we  note  that  the 
averaged  probability  distribution  is  Independent  of  the  sampling  distribution 
and  of  the  original  first  or  last  name  --and  thus  depends  only  on  the  number 
of  names  to  be  stored  in  a  given  drawer.  Hence,  to  equalize  the  probability  of 
overflow  among  drawers  of  the  same  capacity,  the  same  number  of  empty  positions 
should  be  left  in  each  drawer,  no  matter  which  portion  of  the*  alphabet  is  con- 
■balned  in  the  draver. 

Suppose  that  a  given  drawer  begins  with  index  t,  and  the  next  drawer  begins 
with  this  scheme  will  presei-ve  the  original  initial  entry  in  each  drawer, 

and  there  will  be  d  names  in  each  drawer  to  begin  with,  if  the  di-awer  has  a 
capacity  c,  then  the  overflow  probability  for  this  drawer  is: 


^0  “  j  w^(K,n;0,d)  , 

x^c-d+1 

-4- 


(11) 
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which  lor  the  approximation  (lO)  can  be  written  as: 


^0  ~  Cd-1  ^  gF^Cc+l,!;  c-d-ffi;  f)  ,  (12) 

If  the  fraction  f  is  small  con^iared  to  the  fraction  of  drawer  space  that  is  un¬ 
used,  the  hyperj^eometric  function  may  be  approximated  by  unity. 

If  one  has  a  system  of  r  drawers,  which  begin  with  names  .of  index 
^0  ^1^ '  *  *  •’ ^r-l  (*^^^6dnal  first  entries  are  presei*ved  in  each  drawer,  except 

that  zero^always  starts  the  first  drawer),  then  an  analysis  similar  to  the  one 
given  above  will  give  the  Joint  distribution  of  x^  falling  in  the  first  drawer, 

falling  in  the  second  drawer,  . . .  ,x^_^  falling  in  the  r^^  drawer.  We  state 
without  proof: 


In  order  to  calculate  the  probability  that  ^  least  one  of  the  r  drawers  has 
overflowed,  we  must  take  the  tail  sum  of  (ij). 


